Use this file to discover all available pages before exploring further.
The fully supported backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively maintained in this fork.
-ngl 999 offloads all layers to VRAM. Reduce the number if the model does not fit entirely in VRAM.Open http://127.0.0.1:8080 to start chatting.
FlashMLA (for DeepSeek models) requires an Ampere or newer NVIDIA GPU. For DeepSeek inference, also add -mla 3 -fa to your command.
For the best quantization quality, look for models with IQK quants (IQ4_KS, IQ5_K, IQ3_K) or Trellis quants (IQ2_KT, IQ3_KT) on HuggingFace. These are exclusive to ik_llama.cpp and outperform standard k-quants at the same bit-width.